Twitter Author Profiling Using Word Embeddings and Logistic Regression

نویسندگان

  • Liliya Akhtyamova
  • John Cardiff
  • Andrey Ignatov
چکیده

The general goal of the author profiling task is to determine various social and demographic aspects of the author based on his pieces of writing. In this work, we propose an approach that combines word embeddings and classical logistic regression for identifying author gender and language variety based on the corresponding tweets. The model was trained on PAN 2017 Twitter Corpus that contains data for English, Spanish, Portuguese and Arabic languages from more than 11 thousand authors. Due to its simplicity, the proposed solution can be treated as a baseline for both gender and language variety identification subtasks.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Author Profiling with Doc2vec Neural Network-Based Document Embeddings

To determine author demographics of texts in social media such as Twitter, blogs, and reviews, we use doc2vec document embeddings to train a logistic regression classifier. We experimented with age and gender identification on the PAN author profiling 2014–2016 corpora under both singleand cross-genre conditions. We show that under certain settings the neural network-based features outperform t...

متن کامل

Using TF-IDF n-gram and Word Embedding Cluster Ensembles for Author Profiling

This paper presents our approach and results for the 2017 PAN Author Profiling Shared Task. Language-specific corpora were provided for four langauges: Spanish, English, Portuguese, and Arabic. Each corpus consisted of tweets authored by a number of Twitter users labeled with their gender and the specific variant of their language which was used in the documents (e.g. Brazilian or European Port...

متن کامل

Cross-Genre Age and Gender Identification in Social Media

This paper gives a brief description on the methods adopted for the task of author-profiling as part of the competition PAN 2016 [1]. Author profiling is the task of predicting the author’s age and gender from his/her writing. In this paper, we follow a two-level ensemble approach to tackle the cross-genre author profiling task where training documents and testing documents are from different g...

متن کامل

Language Variety and Gender Classification for Author Profiling in PAN 2017

We describe the method of Author Profiling task. The task deals with study of profile aspects like gender and language variety. We explore an approach of using high-order char n-grams as features and logistic regression as a classifier for all subtasks. This approach appears to be simple and effective for the task. We also investigated feature importances and low-dimensional embeddings of the d...

متن کامل

Improving Twitter Sentiment Classification Using Topic-Enriched Multi-Prototype Word Embeddings

It has been shown that learning distributed word representations is highly useful for Twitter sentiment classification. Most existing models rely on a single distributed representation for each word. This is problematic for sentiment classification because words are often polysemous and each word can contain different sentiment polarities under different topics. We address this issue by learnin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017